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ABSTRACT 


This  paper  Introduces  several  data  analysis  routines 
that  were  designed  for  Interactive  use  with  APL  (A  program 
ming  l^anguage)  and  placed  In  the  APL  user  library  at  the 
Naval  Postgraduate  School.  Specifically,  histograms,  den- 
sity estimation  and  probability  plotting  routines  are  both 
explained  In  detail  and  demonstrated  with  actual  data.  In 
addition,  applications  and  limitations  on  each  of  the  rou- 
tines are  explored.  And,  the  combined  routines  give  the 
general  user  an  extensive  tool  to  analyze  either  discrete 
or  continuous  data. 
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I.  INTRODUCTION 


The  Naval  Postgraduate  School  acquired  APL  (A  program- 
ming (.anguage)  from  IBM  in  1974.  Since  that  time  more  and 
more  students  and  faculty  have  become  familiar  with  the  ex- 
tensive and  efficient  capabilities  of  APL  and  have  been 
putting  these  features  to  good  use.  With  the  acquisition  of 
APL  came  several  extensive  library  routines  that  are  both 
well  documented  and  varied  in  scope.  However,  on  close  ex- 
amination of  these  library  routines  it  was  found  that  statis- 
tics and  data  analysis  were  areas  where  some  additions  would 
be  particularly  useful. 

Because  of  the  efficiency  and  ease  of  APL  in  manipulat- 
ing vectors,  matrices  and  arrays,  it  is  ideal  for  use  in  the 
area  of  data  analysis.  After  a complete  and  thorough  screen- 
ing of  the  existing  APL  library  routines  pertaining  to 
data  analysis,  it  was  found  that  by  adding  six  additional 
data  analysis  routines  to  the  present  library,  the  Naval  Post- 
graduate School  could  enhance  its  present  APL  capability 
and  provide  the  student  and  general  user  with  a more  varied 
and  flexible  tool  for  analyzing  data. 

To  this  end  the  purpose  of  this  thesis  will  be  (1)  to  com- 
pletely describe  the  six  data  analysis  routines  added  to  the 
APL  library,  (2)  to  explain  the  features  and  capabilities  of 
each  of  the  routines  and  (3)  to  demonstrate  the  use  of  each 
of  the  routines  with  "real  world  data". 
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The  data  to  be  used  in  this  paper  has  come  from  two  dif- 
ferent sources.  The  first  source  of  data  was  from  tests  per- 
formed jointly  by  IBM  Germany  and  the  German  Public  Telephone 
Network  on  errors  in  transmission  of  binary  data  on  telephone 
lines  (Lewis  & Cox,  1966).  From  this  source  two  sets  of  data 
are  used  and  each  data  set  contains  the  times  between  errors 
in  binary  bits  transmitted  over  telephone  lines.  The  first 
data  set  contains  672  elements  ( times-between-errors : actual- 
ly number  of  bits  between  errors)"and  will  hereby  be  referred 
to  as  "telephone  data  1".  The  second  data  set  contains  736 
elements  and  will  be  referred  to  as  "telephone  data  2".  The 
second  source  of  data  was  obtained  from  percent  overrun  or 
underrun  on  selected  military  contracts  during  the  year  1950 
(Dixon,  1973).  This  data  set  contains  22  elements  and  will 
be  referred  to  as  "cost  overrun  data". 
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II.  HISTOGRAM  ROUTINE 


A.  DESCRIPTION 

The  first  routine  to  be  presented  is  the  histogram  rou- 
tine which  is  used  for  estimating  from  given  data  the  proba- 
bility density  function  f(x)  of  a continuous  random  vari- 
able. The  current  APL  library  has  several  small  histogram 
routines  that  are  general  in  nature  but  lack  the  overal  de- 
tail necessary  for  good  data  analysis.  For  this  reason  HIST 
(histogram  routine)  was  created.  HIST  represents  the  adap- 
tion and  modification  of  the  fortran  library  version  of 
HISTG/F,  which  was  developed  at  N.P.S.  by  D.  R.  Robinson 
under  the  guidance  of  Professor  P.A.W.  Lewis.  By  modifying 
and  adapting  HISTG/F  to  APL  the  power  and  efficiency  of 
the  APL  language  could  be  put  to  full  use. 

A complete  description  of  how  HIST  operates  is  con- 
tained in  the  variable  HISTHOW.  If  the  users  APL  work- 
space is  properly  loaded  (see  section  IX. B.  for  workspace 
loading  procedures)  all  that  is  necessary  is  to  type  HIST- 
HOW. The  user  then  receives  the  following  printed  response 
on  the  terminal: 

HISTHOW 
SYNTAX  HIST 

HIST  ALLOWS  YOU  TO  INTERACTIVELY  OBTAIN  A HISTOGRAM  OF 
YOUR  DATA  ALONG  WITH  A SET  OF  BASIC  DESCRIPTIVE  STATISTICS. 
IN  ADDITION,  HIST  HAS  THE  FOLLOWING  CAPABILITIES  WHICH  ALLOW 
YOU’. 


(1)  THE  OPTIOH  OF  A TITLE  FOR  YOUR  HISTOGRAM 

(2)  THE  OPTIOH  OF  DISPLAY IHG  A SMOOTHED  EMPIRICAL  DENSITY 
FUHCTIOH  OVER  THE  HISTOGRAM 

(3)  THE  OPTIOH  OF  SCALIHG  AHD  SELECTING  THE  HUMBER  OF 
CELLS  FOR  YOUR  HISTOGRAM 

(4)  THE  OPTION  OF  SELECTING  AH  INTERVAL  AHD  PERFORMING  A 
HISTOGRAM  OH  ALL  THE  DATA  POINTS  OR  CONDITIONALLY 
SELECTING  AN  INTERVAL  IN  THE  RANGE  OF  THE  DATA. 

(5)  THE  OPTION  OF  HAVING  YOUR  OUTPUT  APPEAR  ON  THE 
OFFLINE  PRINTER  OR  ON  YOUR  TERMINAL 


WHEN  YOU  TYPE  HIST  YOU  WILL  BE  ASKED  TO  DO  THE  FOLLOWING i 

(1)  ENTER  YOUR  DATA  IN  VECTOR  FORM  - YOU  CAN  TYPE  YOUR  DATA 
IN  SINGLY  OR  YOU  CAN  TYPE  THE  NAME  OF  A VARIABLE  THAT 
BAS  YOUR  DATA  IN  IT.  YOU  MUST  ENSURE  TEAT  YOU  HAVE  AT 
LEAST  10  DATA  POINTS  IN  YOUR  VECTOR  AND  THAT  THERE  IS 
SOME  DIFFERENCES  IN  THE  DATA  POINTS  (MAX  SIZE  OF  INTEGER 
VECTOR  IS  APPROX.  2500  , MAX  SIZE  OF  REAL  VECTOR  IS 
2000  ).  AFTER  YOU  HAVE  ENTERED  YOUR  DATA  YOU  WILL  BE 
ASKED 

(2)  IF  YOU  DESIRE  A SMOOTHED  EMPIRICAL  DENSITY  FUNCTION  OR 
NOT.  THE  EMPIRICAL  DENSITY  FUNCTION  WHEN  PLOTTED  GIVES 
ESSENTIALLY  A MORE  EXACT  PICTURE  OF  THE  DATA  THAN  DOES 
THE  HISTOGRAM  ALONE ^ ALTHOUGH  THIS  FEATURE  IS  SLIGHTLY 
BLURRED  BY  THE  PRECISION  WHICH  CAN  BE  OBTAINED  WITH  THE 
APL  BALL  (THE  APL  FINE  PLOT  IS  NOT  PRESENTLY  AVAILA- 
BLE ON  THE  NPS  SYSTEM).  THE  SMOOTHED  EMPIRICAL  DENSITY 
IS  DEFINED  BY  THE  RELATION  (LEWIS , LIU .ROBINSON . AND  ROS- 
ENBLATT .I'SIS  ROSENBLATT. 


1 -S. 

F(Z)  = 

\ 

N 

N X B(N)  / 

1=1 

WHERE  N IS  THE  NUMBER  OF  DATA  POINTS.  B(N)  IS  A BAND- 
WIDTH FUNCTION. 


B(N)  = RANGE  ♦ SQRT(N) 

AND  W IS  A WEIGHT  FUNCTION. 

W(Z)  = 0 IF  \Z\  > 1 

= 1 - iZ I OTHERWISE 
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f(Z)  IS  COMPUTED  FOR  VALUES  OF  Z BETWEEN  THE  MAXIMUM 
AND  THE  MINIMUM  OF  THE  SAMPLE  AND  PLOTTED  OVER  THE 
HISTOGRAM  USING  THE  SXMBOL  -F-.  THE  RELATIVE  FREQUENCY 
MARKS  ON  THE  LEFT  OF  THE  OUTPUT  REFER  TO  THE  HISTOGRAM, 
AND  NOT  TO  THE  DENSITY  FUNCTION.  AFTER  THIS  QUERY  YOU 
WILL  BE  ASKED 

(Z)  IF  YOU  DESIRE  TO  TITLE  YOUR  HISTOGRAM.  IF  YOU  ELECT  TO 
TITLE  YOUR  HISTOGRAM,  SIMPLY  TYPE  YOUR  TITLE,  ENSURING 
THAT  YOUR  TITLE  IS  MORE  THAN  ONE  CHARACTER  IN  LENGTH. 
IF  NO  TITLE  IS  DESIRED  JUST  HIT  THE  CARRIAGE  RETURN. 
AFTER  THE  TITLE  QUERY  YOU  WILL  BE  ASKED 

(4)  IF  YOU  WANT  TO  SET  YOUR  OWN  SCALE  AND  THE  NUMBER  OF 

CELLS.  YOUR  RESPONSE  MUST  BE  A VECTOR  OF  3 ELEMENTS 
THE  FIRST  ELEMENT  IS  THE  NUMBER  OF  CELLS  YOU  DESIRE, 
THIS  MUST  BE  AN  INTEGER  BETWEEN  10  AND  28  , THE 

SECOND  ELEMENT  IS  THE  LEFT  SCALE  POINT  AND  THE  THIRD 
ELEMENT  IS  THE  RIGHT  SCALE  POINT  (HIST  DOES  NOT  REQUIRE 
THAT  YOUR  INTERVAL  BE  DIVISIBLE  BY  THE  NUMBER  OF  CELLS). 
IF  YOU  WANT  HIST  TO  AUTOMATICALLY  SCALE  AND  PICK  THE 
CELLS  YOU  SHOULD  TYPE  THE  VECTOR  000  . AFTER  YOU 

HAVE  SELECTED  YOUR  SCALING  TECHNIQUE  YOU  WILL  BE  ASKED 

(5)  IF  YOU  WANT  DATA  POINTS  NOT  INSIDE  THE  SCALE  LIMITS 
INCLUDED  IN  THE  HISTOGRAM  ROUTINE.  MOST  HISTOGRAMS  LUMP 
DATA  POINTS  THAT  FALL  OUTSIDE  THE  SCALE  LIMITS  IN  THE 
END  CELLS.  HOWEVER,  HIST  GIVES  YOU  THE  OPTION  OF 
INCLUDING  THEM  OR  EXCLUDING  THEM,  I.E.  OF  OBTAINING  A 
HISTOGRAM  FOR  THE  CONDITIONAL  DENSITY.  AFTER  YOUR  RE- 
SPONSE TO  THIS  QUERY  YOU  WILL  BE  ASKED 

(6)  IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  THE  OFFLINE  PRINTER 
OR  ON  YOUR  TERMINAL.  IF  YOU  SELECT  THE  OFFLINE  PRINTER 
THE  NEXT  RESPONSE  YOU  WILL  RECEIVE  ON  YOUR  TERMINAL  IS 
- HISTOGRAM  SENT  TO  PRINTER  -.  THIS  RESPONSE  WILL  TAKE 
SEVERAL  SECONDS  AND  AFTER  IT  IS  RECEIVED  YOUR  TERMINAL 
IS  FREE  FOR  FURTHER  USE.  HOWEVER,  IF  YOU  ELECTED  TO 
HAVE  YOUR  HISTOGRAM  PRINTED  ON  YOUR  TERMINAL  THE 
PRINTING  WOULD  BEGIN  IN  JUST  A FEW  SECONDS  BUT  WOULD 
TAKE  BETWEEN  5 AND  10  MINUTES  TO  COMPLETE. 


THE  FOLLOWING  BASIC  DESCRIPTIVE  STATISTICS  ARE  COMPUTED 
AND  PRINTED  OUT  BY  HIST. 

MEAN,  MEDIAN,  TRIMEAN,  MIDMEAN,  MODE 

GEOMETRIC  AND  HARMONIC  MEANS  (POSITIVE  SAMPLES  ONLY) 
VARIANCE,  STANDARD  DEVIATION,  COEFFICIENT  OF  VARIATION, 
RANGE  AND  MIDSPREAD 

THIRD  AND  FOURTH  CENTRAL  MOMENTS,  COEFFICIENTS  OF  SKEW- 
NESS AND  KURTOSIS 

MAXIMUM,  MINIMUM  AND  5 SAMPLE  QUANTILES 


IN  ADDITION,  THE  MEAN  IS  DISPLAYED  ON  THE  HISTOGRAM  BY  A 

VERTICAL  COLUMN  OF  -W-  AND  THE  QUARTILES  BY  COLUMNS  OF 

DOTS. 

LIZRRI&SILIIQ.  US.  QRUiLZ 

THE  DEFINITIONS  OF  THE  BASIC  STATISTICS  COMPUTED  BY  HIST 

ARE  LISTED  BELOW.  PAGE  NUMBER  REFERENCES  ARE  TO  THE  CRC 

STANDARD  MATH  TABLES,  19TH  EDITION  (1971). 

MEAN  AVERAGE  OF  THE  SAMPLE  (P  554). 

MEDIAN  MID -VALUE  OF  THE  SAMPLE,  IF  THERE  ARE  AN  ODD 

NUMBER  OF  SAMPLE  POINTS,  OR  THE  AVERAGE  OF  THE  TWO 
MIDDLE  VALUES  FOR  AN  EVEN  NUMBER  OF  POINTS  (P  555) 

SAMPLE  THE  Q(l)=.25,  fi(2)=.50,  AND  fl(3)=.75  POPULATION 

QUARTILES  QUARTILES  ARE  THE  SOLUTION  TO  THE  EQUATION 
PROS  {X  S X(Q(.I)))  = 5(1)  1=1, 2, 3 . THE  SAMPLE 

QUARTILES,  WHICH  ESTIMATE  THE  POPULATION  QUARTILES 
ARE,  THE  JZg  ORDERED  VALUE  IN  THE  SAMPLE,  WHERE 
J = L QiI)>(N  ] + 1 . WHERE  N = SAMPLE  SIZE. 

TRIMEAN  0.25  * (5(1)  + 25(2)  + 5(3)),  WHERE  THE  Q{I)  ARE 
THE  QUARTILES. 

MIDMEAN  THE  AVERAGE  OF  ALL  THE  SAMPLE  VALUES  BETWEEN  THE 
UPPER  AND  LOWER  QUARTILES. 

MODE  THE  DATA  POINT  THAT  OCCURS  MOST  OFTEN  (IF  ALL  THE 

DATA  POINTS  ARE  DIFFERENT  OR  IF  THERE  ARE  MORE 
THAN  300  DATA  POINTS  THE  MODE  WILL  NOT  BE  PRINTED. 
IF  TWO  OR  MORE  MODES  OCCUR  HIST  WILL  PRINT  THE 
FIRST  MODE. ) 

MIDRANGE  AVERAGE  OF  THE  MAXIMUM  AND  MINIMUM. 

GEOMETRIC  (P  554). 

MEAN 

HARMONIC  (P  555). 

MEAN 

VARIANCE  (P  557).  UNBIASED  ESTIMATORS  FOR  VARIANCE  AND 
STANDARD  DEVIATION  ARE  USED. 

STANDARD  (P  557). 

DEVIATION 


13 


COEFFICIENT  OF  VARIATION  = STANDARD  DEVIATION  ♦ \MEAN\  WHEN 
THE  MEAN  IS  LESS  THAN  lE-30,  THE  COEFFICIENT  OF 
VARIATION  IS  SET  TO  ZERO. 

MEAN  (P  556).  THE  AVERAGE  OF  THE  SUM  OF  THE  ABSOLUTE 

DEVIATION  DIFFERENCES  BETWEEN  THE  SAMPLE  VALUES  AND  THE 
MEDIAN. 

RANGE  MAXIMUM  - MINIMUM  (P  557). 

midspread  Q(3)  - fl(l)  , ALSO  CALLED  THE  INTERQUARTILE 
DISTANCE. 

W3  THIRD  CENTRAL  MOMENT.  UNBIASED  ESTIMATOR  IS  USED. 

(P  558) 

.'#4  FOURTH  CENTRAL  MOMENT.  UNBIASED  ESTIMATOR  IS  USED. 

(P  558) 


COEFFICIENT  OF  SKEWNESS  M2  * (STD  DEV)*2 

COEFFICIENT  OF  KURTOSIS  ( W4  ♦ iSTD  DEV)*>*  ) - 3 

BETAl  BIASED  ESTIMATE  OF  THIRD  CENTRAL  MOMENT.  CAN  BE 

USED  IN  TESTING  FOR  NORMALITY.  (BIOMETRIKA  TABLES 
FOR  STATISTICIANS  . 

BETAl  BIASED  ESTIMATE  OF  FOURTH  CENTRAL  MOMENT.  {BIOMET- 

RIKA  TABLES  FOR  STATISTICIANS .122^) . 

MAXIMUM  LARGEST  SAMPLE  VALUE. 

MINIMUM  SMALLEST  SAMPLE  VALUE. 

SAMPLE  THE  a-QUANTILE,  X{a).  IS  THE  SOLUTION  TO  THE  EQ. 
QUANTILES  PROBABILITY  {X  ^ X{ol))  ^ a . 


With  this  complete  description  the  general  user  should 
be  able  to  take  full  advantage  of  HIST  and  put  to  use  all 
its  options. 
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B.  USAGE  WITH  TELEPHONE  DATA  1 AND  TELEPHONE 
DATA  2.  OFFLINE.  ALL  DATA,  ECDF,  AND  TITLE 

HIST  was  now  used  on  two  sets  of  data.  Both  telephone 
data  1 and  telephone  data  2 were  first  used  with  the  offline 
printer  demonstrating  the  title  option,  the  empirical  den- 
sity function  option  and  using  the  conditional  option  with 
any  data  points  outside  the  designated  Interval  being  lumped 
Into  the  end  cells.  When  HIST  was  typed  the  following  re- 
sponses to  each  of  the  queries  were  entered. 

Ill  ST 

ENTER  DATA  IN  VECTOR  FOR'-f 

P • 

TELDATl 

IF  YOU  ALSO  WANT  A SMOOTFFD  F'f^TRTCAL  DENSITY  FUNCTION 
A 1 . IF  YOU  DO  EOT  WANT  IT  ENTER  0 

□ ; 

1 

IF  YOU  WANT  TO  TITLE  YOUR  RISTOCRA''  TYF^  YOU^  TITLE. 
IF  YOU  DO  NOT  WANT  A TITLE  JUST  FIT  TFE  CARRIAOe  eeTUFN . 

TELEPFONE  DATA  1 

IF  YOU  WANT  TO  SET  TEE  NUMRPR  OF  CELLS  AND  TFE  SCALE  ENTER 
FIRST  THE  NUMBER  OF  CELLS  {AN  INTECER  BETWEEN  10  AND  28) 
FOLLOWED  BY  A SPACE  AND  THEN  YOUR  LEFT  SCALE  ROINT  FOLLOWED 
BY  A SPACE  AND  THEN  YOUR  RIGHT  SCALE  POINT.  HOWEVER.  IF  YOU 
WANT  HIST  TO  AUTOMATICALLY  SCALE  ENTER  000 
□ : 

23  0 20000 

GIVEN  THAT  YOU  HAVE  SET  YOUR  OWN  SCALE.  TO  INCLUDE  DATA 
POINTS  THAT  MIGHT  BE  OUTSIDE  YOUR  SCALE  LIMITS  IN  TFE  END 
CELLS,  TYPE  1 . IF  YOU  DESIGNATED  AUTOSCALF  ALSO,  TYPP 

1 . IF  HOWEVER,  YOU  DO  NOT  WANT  THE  DATA  OUTSIDE  TFE  SCALE 

LIMITS  INCLUDED  IN  THE  HISTOGRAM,  TYPE  0 . 

'i 

u 

1 

IF  YOU  WANT  YOUR  OUT'^UT  TO  AR^EAP  ON  TF^  OF^LTNR  RRT”TRR. 
TY°E  1 . IF  YOU  WANT  YOUR  OUTPUT  TO  AP'^EAR  ON  YOU^ 
TERMINAL,  TYPE  0 . {NOTE  IF  YOU  TY”EP  0 SURR  YOm 

TERMINALS  CARRIAGE  RAGE  SETTING  IS  ON  TRE  *^AXIMNM  WTD'”v) 
■2 : 

1 

FISTOGRA't  SENT  TO  PRINTER 


i 


I 

( 


t 

t 


1 

I 

Note  that  telephone  data  1 was  contained  In  the  variable 
TELDATl  and  that  the  number  of  cells  chosen  was  28  with  the 
left  scale  point  being  0 and  the  right  scale  point  being 
20,000. 

After  the  response  - HISTOGRAM  SENT  TO  PRINTER  - was  re-  j 

celved.  HIST  was  again  typed  under  Identical  conditions 
and  telephone  data  2 was  entered  through  the  variable  I 

TELDAT2.  | 

HIST  ] 

EtlTER  DATA  IN  VECTOR  FOR^f 
□ : 

TELDAT7 

IF  YOU  ALSO  WANT  A S'^.OOTHED  F^fPIPICAL  D^NSTTY  FUNCTION  n-vrFP 
A 1 , IF  YOU  no  NOT  WANT  IT  FNT^P  A 0 . 

0: 

1 

IF  YOU  WANT  TO  TITLE  YOUR  RI5T0GRA!f  TYPE  YOUP  TTTLP , 

IF  YOU  DO  NOT  WANT  A TITLE  JUS'^  HIT  THE  CARRIAGE  R’^TURV . 

TELEPHONE  DATA  2 

IF  YOU  WANT  TO  SET  THE  NUMBER  OF  CELLS  AND  THE  SCALE  ’^NTER 
FIRST  THE  NUMBER  OF  CELLS  {AN  INTEGER  BETWEEN  10  AND  28) 

FOLLOWED  BY  A SPACE  AND  THEN  YOUR  LEFT  SCALE  POINT  FOLLOWED 
BY  A SPACE  AND  THEN  YOUR  RIGHT  SCALE  POINT.  HOWEVER,  IF  YOU 
WANT  HIST  TO  AUTOMATICALLY  SCALE  ENTER  000  . 

□ : 

28  0 20000 

GIVEN  THAT  YOU  HAVE  SET  YOUR  OWN  SCALE,  TO  INCLUDE  DATA 
POINTS  THAT  MIGHT  BE  OUTSIDE  YOUR  SCALE  LIMITS  IN  THE  END 
CELLS,  TYPE  1 . IE  YOU  DESIGNATED  AUTOS  CALF  ALSO,  TY^f 

1 . IE  HOWEVER,  YOU  DO  NOT  WANT  THE  DATA  OUTSIDE  THE  SCALE 

LIMITS  INCLUDED  IN  THE  HISTOGRAM,  TYPE  0 
□ ; 

1 

IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  0^  TH^  OFFLINE  PRINTER, 

TYPE  1 . IE  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  YOU^ 

TERMINAL,  TYPE  0 . {NOTE  IE  Ynri  CY^FU  0 RF  SUR’^  YOUF 

TERMINALS  CARRIAGE  PAGE  SETTING  IS  ON  THE  WT^'^f) 

□ : 

1 

HISTOGRAM  SENT  TO  PRINTER 
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Now  by  looking  at  figure  1 (output  for  telephone  data  1) 
and  figure  2 (output  from  telephone  data  2}  the  similarities 
and  differences  In  the  histograms  can  be  compared.  Without 
getting  Into  specifics,  the  empirical  density  function  plot 
seems  to  Indicate  that  both  sets  of  data  are  similar.  How- 
ever, the  one  time-between-errors  dominate  the  data;  a more 
detailed  discussion  of  this  data  and  Its  analysis  Is  given 
In  Section  VIII. 
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C.  USAGE  WITH  TELEPHONE  DATA  1 AND  TELEPHONE  DATA  2,  ON 

LINE;  CONDITIONAL  DATA  BETWEEN  2 AND  140,  ECDF,  AND  TITLE 

Because  both  sets  of  data  contain: 

(1)  a large  number  of  elements, 

(2)  a large  number  of  times-between-error  equal  to  1 
(this  becomes  more  apparent  when  HISTLIST  is 
described),  and 

(3)  the  range  of  the  data  sets  is  so  extensive, 

it  would  appear  that  the  conditional  option  available  on 
HIST  could  be  used  to  see  if  the  two  data  sets  are  in  fact 
similar  over  a smaller  interval.  This  in  fact  was  done  us- 
ing the  on  line  printer  option,  the  empirical  density  func- 
tion option,  the  title  option  and  the  conditional  option 
with  any  data  points  outside  the  designated  interval  excluded 
from  the  histogram. 


filST 

ENTER  DATA  IN  VECTOR  FOR’f 
C: 

TELDATl 

IF  YOU  ALSO  WANT  A S’-fOOTRED  EMPIRICAL  DENSITY  FUNCTION  ENTER 

A 1 . IF  YOU  DO  NOT  WANT  IT  ENTER  A 0 

0: 

1 

IF  YOU  WANT  TO  TITLE  YOUR  HISTOORAM  TYPE  YOUR  TITLE, 

IF  YOU  DO  NOT  WANT  A TITLE  JUST  HIT  TRE  CARPI ACE  RETURN. 

TELEPHONE  DATA  1 BETWEEN  2 AND  140 

IF  YOU  WANT  TO  SET  THE  NUMBER  OF  CELLS  AND  THE  SCALE  ENTER 
FIRST  THE  NUMBER  OF  CELLS  {AN  INTEGER  BETWEEN  10  AND  28) 
FOLLOWED  BY  A SPACE  AND  THEN  YOUR  LEFT  SCALE  POINT  FOLLOWED 
BY  A SPACE  AND  THEN  YOUR  RIGHT  SCALE  POINT.  HOWEVER.  IF  YOU 
WANT  HIST  TO  AUTOMATICALLY  SCALE  ENTER  000 

lj  : 


28  2 140 


GTVEH  THAT  YOU  HAVE  SET  YOUR  OWU  SCALE,  TO  TRCLUOr  D^TA 
POTRTS  THAT  ^TGRT  BE  OUTSIDE  YOUR  SCALE  LI'UTS  T»  T^E 
CELLS,  TYPE  1 . IE  YOU  DESICRATED  AUTOS '"ALE  ALSO,  TY^p 

1 . IE  ROUEVEl,  YOU  DO  ROT  UART  TRE  uatA  OU'T’STDE  TRE  SCALE 

LUtlTS  INCLUDED  IR  TRE  RISTOCRA“,  TY^E  0 . 

C: 

0 

IE  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  OR  TRC  OEELTRE  PRINTER, 
TYPE  1 . IE  YOU  WANT  YOUR  OUT^U'^  TO  A'°f>EAR  OR  YOUR 

TER^HNAL,  TYPE  0 . {NOTE  IE  YOU  TYPED  0 BE  SURf  YOUR 

TERMINALS  CARRIAGE  PAGE  SETTING  IS  OR  TRE  MAXIMUM  WIDTR) 

□ : 

0 


Note  that  the  same  variable  TEL0AT1  is  used  but  this 
time  the  interval  was  between  2 and  140.  Also,  the  - 
HISTOGRAM  SENT  TO  PRINTER  - was  not  typed  because  the  on- 
line printer  (terminal)  option  was  employed. 

After  the  output  for  telephone  data  1 was  printed  HIST 
was  again  typed  and  telephone  data  2 was  entered  under  iden- 
tical conditions. 

HIST 

ENTER  DATA  IR  VECTOR  EOR'* 

2 : 

TELDAT2 

IE  YOU  ALSO  WART  A SMOOTHED  EMPIRICAL  DCRSITY  ^VRCTTOR  ’^RT'^R 
A 1 . IF  YOU  DO  ROT  WART  IT  ENTER  A 0 . 

2: 

1 

IE  YOU  WANT  TO  TITLE  YOUR  HISTOGRAM  TYPE  YOUR  TITLE. 
IF  YOU  DO  NOT  WANT  A TITLE  JUST  HIT  THE  CARRIAGE  RETURN. 

TELEPHONE  DATA  2 BETWEEN  2 AND  140 

IF  YOU  WANT  TO  SET  THE  NUMBER  OF  CPLLS  AND  THE  SCALE  ENTER 
FIRST  THE  NUMBER  OF  CELLS  {AN  INTEGER  BETWEEN  10  AND  28) 
FOLLOWED  BY  A SPACE  AND  THEN  YOUR  LEFT  SCALE  POINT  FOLLOWED 
BY  A SPACE  AND  THEN  YOUR  RIGHT  SCALE  POINT.  HOWEVER,  IF  YOU 
WANT  HIST  TO  AUTOMATICALLY  SCALE  ENTER  000  . 

G: 

28  2 140 
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IIVEN  THAT  YOU  HAVE  SET  YOUR  OWH  SCALE ^ TO  IHCLUDE  DATA 
POINTS  THAT  MIGHT  BE  OUTSIDE  YOUR  SCALE  LIMITS  f'  T^E  END 
CELLS ^ TYPE  1 . IF  YOU  DESIGNATED  AUTOSCALE  ALSO,  TY^E 

1 . IF  HOWEVER,  YOU  DO  NOT  WANT  THE  DATA  OUTSTD^  TDF  S^'ALE 

LIMITS  INCLUDED  IN  THE  HISTOGRAM,  TY^E  0 . 

C: 

0 

IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  THE  OFFLINE  t>RIvtER, 
TYPE  1 . IF  YOU  WANT  YOUR  OUTPUT  TO  A'°PEAR  C>w  YOUR 

TERMINAL,  TYPE  0 . {NOTE  IF  YOU  TYPED  0 BP  SURE  YOUR 
TERMINALS  CARRIAGE  PAGE  SETTING  IS  ON  THE  MAXTMUM  WID7N) 
□ : 

0 


Figure  3 (output  from  telephone  data  1 between  2 and 
140)  and  figure  4 (output  from  telephone  data  2 between  2 
and  140)  now  appear  quite  different  in  shape  based  on  the 
empirical  density  function  plot.  This  is,  again,  because 
of  the  extensive  range  of  the  data  (85,993  for  telephone 
data  1 and  67,271  for  telephone  data  2)  and  the  large  number 
of  times-between-error  equal  to  one.  Both  sets  of  data  are 
actually  discrete,  only  occurring  at  multiples  of  1,  but  as 
an  initial  analysis  the  data  sets  were  treated  as  continuous. 
Thus,  by  employing  the  conditional  option  available  on  HIST 
differences  in  the’ two  sets  of  data  become  quite  apparent 
whereas  before,  the  differences  were  not  so  easily  detected. 
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TELEPHOHE  OATA  i UETWFEM  2 ADD  t‘iO 
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CENTRAL  TENDENCY  SPREAD  IIICHFR  CENTRAL  EOf'ENTS  DJSTRJPtlTinN 


III.  LISTING  ROUTINE 


A.  DESCRIPTION 

The  second  routine  presented  is  a listing  routine.  APL 
has  a function  that  will  automatically  sort  the  data  and 
print  the  results.  However,  the  unique  feature  of  HISTLIST 
(listing  routine)  is  that  it  takes  advantage  of  like  occur- 
rences in  the  data  and  prints  the  ordered  data  ascendingly 
in  a compressed  form.  This  becomes  highly  useful  when  list- 
ing a large  number  of  data  points  that  contain  multiple  oc- 
currences. It  is  also  a tool  for  finding  multiplicities  in 
supposedly  continuous  data,  and  a probability  function  esti- 
mating routine  for  data  which  is  known  to  be  discrete. 

A complete  description  of  how  HISTLIST  operates  is 
contained  in  the  variable  HISTLISTHOW.  When  the  user  types 
HISTLISTHOW  the  following  response  is  printed  on  the 
termi nal : 

HISTLISTHOW 


SYNTAX  HISTLIST 

HISTLIST  IS  A HIGHLY  CONVENIENT  WAY  TO  LIST  YOUR  DATA. 
HISTLIST  TAKES  YOUR  DATA,  ORDERS  IT  AND  COMPRESSES  IT.  FOR 
EXAMPLE,  IF  THREE  DATA  POINTS  WERE  ALL  THE  SAME  VALUE 
HISTLIST  WOULD  JUST  PRINT  THE  VALUE  ONCE  AND  THEN  PRINT  THE 
NUMBER  OF  OCCURENCES  OF  THAT  VALUE.  HISTLIST  WILL  ALSO 
PRINT  THE  SERIAL  NUMBER  OF  THE  DATA,  THE  PERCENTAGE  THIS 
SAMPLE  VALUE  IS  TO  THE  WHOLE  SAMPLE,  AND  A SMALL  HISTOGRAM 
{STARS)  SHOWING  RELATIVE  PERCENTAGES . EXAMPLE:  64434 

HISTLIST 
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SER.  NUM. 

ORDERED  DATA 

NUMBER  OF 

OCCURENCES 

PER  CENT 

1 

3 

1 

.20 

2 

4 

3 

**  .60 

5 

6 

1 

**** 

.20 

HISTLIST  IS  IDEALLY  SUITED  FOR  A LARGE  SAMPLE  THAT  COULD 
POSSIBLY  HAVE  A LOT  OF  LIKE  OCCURENCES.  HISTLIST  FURTHER  HAS 
THE  ADVANTAGE  OF  BEING  USED  WITH  EITHER  THE  OFFLINE  PRINTER 
OR  THE  USERS  TERMINAL. 


B.  USAGE  WITH  TELEPHONE  DATA  1 AND  TELEPHONE  DATA  2 OFFLINE 
HISTLIST  was  used  with  the  title  option  and  offline 
printer  option  on  both  telephone  data  1 and  telephone  data  2. 
When  HISTLIST  was  typed  the  following  responses  to  each  of 
the  queries  were  entered. 

HISTLIST 

HISTLIST  PRINTS  THE  SERIAL  NUMBER  OF  T^E  COMPRESSED 
DATA.  THE  ORDERED  DATA  COMPRESSED . A^D  T^E  NU'fRER  OF 
LIKE  OCCURENCES . ENTER  YOUR  DATA  IN  VECTOR  FOpy . 

G: 

TELDATl 

IF  YOU  WANT  TO  TITLE  YOUR  DATA  TY^E  YOUR  T^TLE . 

IF  YOU  DO  NOT  WANT  A TITLE  JUST  HIT  THE  CARRIAGE 
RETURN. 

TELEPHONE  DATA  1 

IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  THE  OFFLINE 
PRINTER  TYPE  1 . IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR 

ON  YOUR  TERMINAL  TYPE  0 . 

G: 

1 

HISTLIST  SENT  TO  PRINTER 


After  the  response  - HISTLIST  SENT  TO  PRINTER  - was  re 
ceived  HISTLIST  was  again  typed  and  telephone  data  2 was 
entered . 


BISTLIST 

BISTLIST  PRINTS  TBS  SERIAL  NUMBER  OF  THE  COMPRESSED 
DATA,  TBE  ORDERED  DATA  COMPRESSED , AND  TBE  NUMBER  OF 
LIKE  OCCURENCES . ENTER  YOUR  DATA  IN  VECTOR  FORM. 
0: 

TELDAT2 

IF  YOU  WANT  TO  TITLE  YOUR  DATA  TYPE  YOUR  TITLE. 
IF  YOU  DO  NOT  WANT  A TITLE  JUST  HIT  T^E  CARPIAOE 
RETURN. 

TELEPBONE  DATA  2 

IF  YOU  WANT  YOUR  OUTPUT  TO  APPEAR  ON  TRF  OFFLINE 
PRINTER  TYPE  1 . IF  YOU  WANT  YOUR  OUT'^UT  TO  APPEAR 

ON  YOUR  TERMINAL  TYPE  0 . 

□ ; 

1 

BISTLIST  SENT  TO  PRINTER 


Looking  at  figure  5 (output  with  telephone  data  1)  and 
figure  6 (output  with  telephone  data  2)  the  listings  of  the 
two  data  sets  can  be  compared.  It  can  be  seen  that  both 
telephone  data  1 and  telephone  data  2 contain  a large  number 
of  multiple  occurrences  of  the  number  one  and  the  number  two. 
In  fact  19%  of  telephone  data  1 is  the  number  one  and  24% 
of  telephone  data  2 is  the  number  one.  Also,  telephone  data 
2 has  many  more  multiple  occurrences  in  the  120  to  130  range 
than  telephone  data  1.  This  was  quickly  apparent  when  one 
looked  at  the  stars  to  the  right  of  the  ordered  data. 
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FIGURE  5B 
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FIGURE  5C 
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FIGURE  6A 
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FIGURE  6B 
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FIGURE  6C 
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In  addition,  HISTLIST  saved  on  printing  time  and  paper. 

By  printing  the  data  in  compressed  form  HISTLIST  saved 
printing  448  lines  (6  additional  pages)  in  the  case  of  tele- 
phone data  1 and  419  lines  (5  additional  pages)  in  the  case 
of  telephone  data  2.  Thus,  HISTLIST  not  only  gives  the 
user  more  information  than  an  ordered  listing  of  the  data, 
but  also  is  cost  effective  in  terms  of  printing  time  and 
paper  used.  Finally,  note  that  it  is  not  possible  to  look 
at  the  data  in  as  much  detail  with  routine  HIST  as  with 
HISTLIST.  If  the  data  is  continuous  and  there  are  no  multi- 
plicities, then  HISTLIST  gives  only  this  information  and 
an  ordered  listing  of  the  data.  The  shape  of  the  density 
function  can  best  be  seen  (estimated)  in  using  routine  HIST. 
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IV.  SECTIONING  ROUTINE 


A.  DESCRIPTION 

The  third  routine  presented  is  the  sectioning  routine, 
HISTS.  HISTS  (sectioning  routine)  gives  a way  of  assessing 
the  variability  of  estimates  of  descriptive  statistics  from 
sample  data.  It  is  essential  that  the  data  be  in  random 
order. 

The  basic  idea  is  as  follows:  Assume  we  have  m inde- 

pendent observations  y^  ,y2 » • • • »y,^  of  a random  variable  Y. 

The  usual  estimate  of  its  mean  value  y = E(Y)  is  the  sample 

^ m 

mean  y , where  y = E y^/m  . Now  y is  the  least-squares 

i = l ^ 

estimate  of  y , and  therefore  unbiased  with  variance 
var(y)  = a /m  , where  a = var(y)  . Of  course  a is  un- 
known, but  we  can  estimate  it  from  the  data  with  the  sample 


van  ance 


.2  _ 1 


E (y^  - y) 

i = l 


and  then  estimate  the  variance  of  the  estimate  y of  y as 


Z , m 

,ar(y)  » — • £ (y,  - y)' 


This  is  the  basis  for  the  sectioning  routine:  here  the 
y^  are  estimates  of  descriptive  statistics  from  the  m sec- 
tions of  the  data  and  y is  the  average  of  the  statistics 
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from  each  section.  Estimates  are  assumed  Independent  because 
the  original  data  is  assumed  to  be  independent. 

A complete  description  of  how  HISTS  operates  is  con- 
tained in  the  variable  HISTSHOW.  When  the  user  types  HIST- 
SHOW  the  following  response  is  printed  on  the  terminal: 

HISTSHOW 


SniTAX  HISTS 

HISTS  ALLOWS  YOU  TO  INTERACTIVELI  SECTIOH  YOUR  DATA  AND 
>155^55  THE  VARIABILITY  IN  EACH  OF  THE  DESCRIPTIVE  STATISTICS 
BY  USING  THE  SECTIONED  SAMPLE  DATA. 

WHEN  YOU  TYPE  HISTS  YOU  WILL  BE  ASKED  TO  DESIGNATE  THE 
NUMBER  OF  SECTIONS  YOU  DESIRE.  HISTS  WILL  THEN  TAKE 
THE  UN ORDER ED  DATA  AND  DIVIDE  THE  DATA  INTO  THE  NUMBER 
OF  SECTIONS  YOU  INDICATE  DISCARDING  ANY  DATA  POINTS  LEFT 
OVER.  FOR  EXAMPLE,  IF  YOU  HAVE  301  DATA  POINTS  AND  YOU 
SELECT  10  SECTIONS  HISTS  WILL  PLACE  THE  FIRST  30  DATA  POINTS 
IN  THE  FIRST  SECTION,  THE  SECOND  30  DATA  POINTS  IN  THE 
SECOND  SECTION  AND  SO  ON  UNTIL  THE  LAST  DATA  POINT  IS 
OMITTED.  YOU  WILL  NOW  HAVE  10  SECTIONS  WITH  30  DATA  POINTS 
PER  SECTION. 

HISTS  WOULD  NOW  PRINT  THE  FOLLOWING  STATISTICS  ON  EACH  OF 
THE  SECTIONS'.  MEAN,  MEDIAN,  VARIANCE,  STD  DEV,  COEF  VAR, 
SKEWNESS,  KURTOSIS,  MINIMUM  AND  MAXIMUM.  IN  ADDITION,  THE 
ABOVE  STATISTICS  WOULD  BE  PRINTED  FOR  THE  UNSECTIONED  DATA 
TO  ALLOW  FOR  COMPARISONS. 

FINALLY,  HISTS  WILL  PRINT  (1)  THE  MEAN  OF  THE  SECTIONED 
DATA  STATISTICS . FOR  EXAMPLE,  THE  MEAN  FOR  SKEWNESS  WOULD  BE 
EACH  SECTION  VALUE  FOR  SKEWNESS  SUMMED  UP  AND  DIVIDED  BY  THE 
NUMBER  OF  SECTIONS.  (2)  THE  VARIANCE  AND  STD  DEV  OF  THE 
SECTIONED  DATA  STATISTICS . AND,  (3)  THE  STD  DEV  DIVIDED  BY 
THE  SQUARE  ROOT  OF  THE  NUMBER  OF  SECTIONS,  WHICH  ESTIMATES 
THE  STANDARD  DEVIATION  OF  THE  STATISTICS. 

AS  A RESULT,  HISTS  WILL  GIVE  YOU  AN  UNBIASED  ESTIMATE  OF 
THE  VARIANCE  OF  THE  SAMPLE  MEAN,  MEDIAN,  VARIANCE,  STD  DEV, 
COEF  VAR,  SKEWNESS  AND  KURTOSIS  FROM  USING  THE  SAMPLE 
VARIANCE  OF  THE  SECTIONED  DATA.  WITH  THIS  RESULT,  CONFIDENCE 
INTERVALS  CAN  ALSO  BE  OBTAINED  FOR  EACH  OF  THE  ABOVE  STATIS- 
TICS, IF  THE  ESTIMATES  FROM  THE  SECTIONS  ARE  NORMALLY  DIS- 
TRIBUTED. HISTS  IS  BEST  SUITED  FOR  LARGE  AND  MODERATE  SIZED 
SAMPLES',  FOR  SMALL  SAMPLES  JACKNIFING  SHOULD  BE  CONSIDERED. 
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B.  USAGE  WITH  TELEPHONE  DATA  1 

HISTS  was  now  used  on  telephone  data  1 to  assess  the 
variability  in  the  mean,  median,  variance,  standard  devia- 
tion, coefficient  of  variation,  skewness  and  kurtosis.  When 
HISTS  was  typed  the  following  responses  were  entered  (see 
figure  7). 

The  672  data  points  of  telephone  data  1 were  broken  down 
into  16  sections  with  42  data  points  per  section.  Because 
of  this  breakdown  no  data  points  were  discarded. 

The  unsectioned  statistics  printed  can  be  compared  with 
the  values  printed  by  HIST  (figure  1)  and  are  in  fact  the 
same.  Providing  that  the  estimates  are  normally  distributed 
(this  can  be  checked  with  the  normal  plots,  described  later), 
confidence  intervals  for  each  of  the  statistics  (mean,  median, 
variance,  standard  deviation,  coefficient  of  variation,  skew- 
ness and  kurtosis)  based  on  the  t-statistic  can  be  obtained 
in  the  following  manner 

^n  - ^(1  -ha) , (m-1 ) 

Here  y^  is  the  mean  of  the  sectioned  data  statistics  (ob- 
tained from  column  one  under  summary  for  sectioned  data); 
s- 

^n  is  the  standard  deviation  of  the  sectioned  data  statis- 
tic  divided  by  the  square  root  of  the  number  of  sections 
(obtained  from  column  four  under  summary  for  sectioned  data); 
m is  the  number  sections  chosen;  and,  ^ci_;5Ci)  (m-i ) 

1-^50  quantile  of  the  t-distri  bution  with  m-1  degrees  of  free- 
dom. 
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FIGURE  7 
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C.  INTERPRETATION  OF  RESULTS 


As  an  example,  a confidence  interval  for  the  coefficient 
of  variation  was  obtained  in  the  following  manner.  The  mean 
value  of  the  coefficient  of  variation  for  the  16  sections  is 
4.1175  (column  1).  The  standard  deviation  divided  by  the 
square  root  of  16  is  .31128  (column  4).  Using  a = .05  , 
the  t value  with  15  degrees  of  freedom  is  2.131.  Thus,  the 
95%  confidence  interval  for  the  coefficient  of  variation  for 
telephone  data  1 is  4.1175  + ( . 31 1 28) (2. 1 31 ) which  is 
[3.454,  4.781].  Confidence  intervals  on  the  six  other  sta- 
tistics could  be  obtained  in  the  same  fashion. 

Again  note  that  the  use  of  the  variance  estimate  from 
the  sectioned  data  to  give  confidence  intervals  is  based  on 
the  assumption  that  the  estimates  from  the  sections  are  in- 
dependent and  normally  distributed.  The  normality  will  de- 
pend on  the  number  of  observations  in  each  section,  which 
should  be  kept  large  to  induce  normality.  This  requirement 
conflicts  with  the  need  to  make  the  number  of  sections  large 
to  reduce  the  variability  in  the  estimate  of  the  variance  of 
the  statistics. 

Another  problem  is  that  if  the  number  of  observations  in 
each  section  is  small,  the  estimates  may  be  severely  biased. 
This  effect  can  be  seen  in  figure  7:  note  that  all  of  the  16 
estimates  of  skewness  from  the  sections  are  smaller  than  the 
estimate  7.1531  from  the  unsectioned  data. 
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V.  JACKNIFE  ROUTINE 


A.  DESCRIPTION 

The  fourth  routine  presented  is  the  jacknife  routine. 
HISTJACK  (jacknife  routine)  is  another  way  of  assessing  the 
variability  in  the  estimates  from  sample  data,  and  also  of 
reducing  bias  in  estimates  of  the  descriptive  statistics. 

The  jacknife  procedure,  like  the  previous  sectioning 
method,  is  based  on  the  assumption  that  an  independent  and 
identically  distributed  random  sample  x^,X2»...,x^  have 
come  from  a population  with  an  unknown  distribution  function 
Fj^(x)  . If  we  divide  the  sample  into  r groups,  with  each 
group  containing  the  same  number  of  elements,  we  can  obtain 
estimates  9 of  the  descriptive  statistics,  which  we  denote 
generically  as  0 , in  the  same  manner  as  previously  done 
with  the  sectioning  method.  The  difference  here  is  that  the 
descriptive  statistics  are  computed  with  the  j^^  group  de- 
leted j=l,2,...,r  . We  then  let  9^^^  be  the  result  or 

X u 

the  descriptive  statistic  estimate  computed  with  the  j sub- 
group omitted,  and  9^^^  is  the  corresponding  result  or  de- 
scriptive statistic  estimated  from  the  entire  sample  (no 
group  omitted).  The  jacknife  pseudo-val ues  are  then  computed 
in  the  following  way: 
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Then  we  define  the  jacknifed  estimator  to  be: 


The  pseudo-values  can  be  used  to  obtain  variance  estimates 
for  e*  , and  to  set  approximate  confidence  limits,  using 
Student's  t.  The  idea  is  that  the  pseudo-values  will  be  ap- 
proximately independent  and  possibly  normally  distributed. 

The  jacknifed  estimator  0*  is  a sample  average  so  we  form 
2 

an  estimate  s*  of  its  variance  given  by  the  following  re- 
lationship (Miller,  1974): 


This  procedure  is  particularly  useful  if  the  number  n of 
data  points  is  small,  but  it  must  be  used  with  care.  Note, 
that  the  estimator  0*  is  designed  to  eliminate  a 1/n 
bias  term  in  the  estimator  0 . 

A complete  description  of  how  HISTJACK  operates  is  con- 
tained in  the  variable  HISTJACKHOW.  When  the  user  types 
HISTJACKHOW  the  following  response  is  printed  on  the  ter- 
minal. 
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BISTJACKHOW 


SYNTAX  SISTJACK 

EISTJACK  ALLOWS  YOU  TO  INTERACTIVELY  JACKNIFE  YOUR  DATA 
AND  ASSESS  THE  VARIABILITY  IN  EACH  OF  THE  STATISTICAL 
ESTIMATES  BY  USING  THE  SAMPLE  DATA. 

WHEN  YOU  TYPE  HISTJACK  YOU  WILL  BE  ASKED  TO  DESIGNATE  THE 
NUMBER  OF  GROUPS  YOU  DESIRE.  HISTJACK  WILL  TAKE  THE 
UNORDERED  DATA  AND  DIVIDE  THE  DATA  INTO  THE  NUMBER  OF 
GROUPS  YOU  INDICATE  DISCARDING  ANY  DATA  POINTS  LEFT  OVER. 
FOR  EXAMPLE.  IF  YOU  HAVE  22  DATA  POINTS  AND  YOU  SELECT  7 
GROUPS  HISTJACK  WILL  PLACE  THE  FIRST  3 DATA  POINTS  IN  GROUP 
1.  THE  SECOND  3 DATA  POINTS  IN  GROUP  2.  AND  SO  ON  UNTIL  THE 
LAST  DATA  POINT  IS  OMITTED.  YOU  WOULD  NOW  HAVE  7 GROUPS 
WITH  3 DATA  POINTS  PER  GROUP.  IF  YOU  HAD  ELECTED  TO  DO  A 
COMPLETE  JACKNIFE.  THAT  IS  TYPED  22.  YOU  WOULD  NOW  HAVE  22 
GROUPS  WITH  1 DATA  POINT  OMITTED  PER  GROUP. 

HISTJACK  WOULD  NOW  PERFORM  STATISTICAL  COMPUTATIONS  USING 
THE  JACKNIFE  PROCEDURE.  THAT  IS.  BY  OMITTING  ONE  GROUP  AT  A 
TIME.  STARTING  WITH  THE  FIRST  GROUP.  HISTJACK  WOULD  PRINT 
THE  FOLLOWING  STATISTICS'.  MEAN.  MEDIAN.  VARIANCE.  STD  DEV. 
COEF  VAR.  SKEWNESS.  KURTOSIS . MINIMUM  AND  MAXIMUM.  IN 
ADDITION.  THE  ABOVE  STATISTICS  WOULD  BE  PRINTED  FOR  THE 
UNGROUPED  DATA  TO  ALLOW  FOR  COMPARISONS.  {NOTE.  THE  COLUMNS 
GIVE  THE  STATISTIC  ESTIMATED  FROM  ALL  THE  DATA  WITH  ONE 
GROUP  MISSING.  AND  NOT  THE  PSEUDO-VALUES) 

FINALLY.  HISTJACK  WILL  PRINT  (1)  THE  JACKNIFE  ESTIMATE 
(2)  THE  SAMPLE  VARIANCE  OF  THE  PSEUDO -VALUES  DERIVED  IN  THE 
JACKNIFE  ESTIMATE  (3)  AND.  THE  ESTIMATED  STD  DEV  OF  THE 
JACKNIFE  ESTIMATE  DIVIDED  BY  THE  SQUARE  ROOT  OF  THE  NUMBER 
OF  GROUPS. 

AS  A RESULT.  HISTJACK  WILL  GIVE  YOU  AN  ESTIMATE  OF  THE 
VARIANCE  OF  THE  SAMPLE  MEAN.  MEDIAN.  VARIANCE.  STD  DEV.  COEF 
VAR.  SKEWNESS  AND  KURTOSIS  USING  THE  SAMPLE  VARIANCE  OF  THE 
JACKNIFED  DATA.  WITH  THIS  RESULT.  CONFIDENCE  INTERVALS  CAN 
BE  OBTAINED  FOR  EACH  OF  THE  ABOVE  STATISTICS.  AGAIN  ASSUMING 
THAT  THE  PSEUDO-VALUES  ARE  APPROXIMATELY  INDEPENDENT  AND 
NORMALLY  DISTRIBUTED . HISTJACK  IS  BEST  SUITED  FOR  SMALL 
SAMPLES. 
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B.  USAGE  WITH  TELEPHONE  DATA  1 


HISTJACK  was  now  used  on  telephone  data  1 to  assess  the 
variability  In  the  mean,  median,  variance,  standard  devia- 
tion, coefficient  of  variation,  skewness  and  kurtosis.  When 
HISTJACK  was  typed  the  following  responses  were  entered. 

(see  figure  8) 

The  672  data  points  were  broken  down  Into  16  groups  with 
42  data  points  per  group.  Again,  because  of  this  breakdown 
no  data  points  were  discarded. 

The  ungrouped  statistics  printed  are  again  the  same 
values  that  were  printed  by  HIST  (figure  1).  Using  the 
jacknife  method,  confidence  Intervals  for  each  of  the  statis- 
tics (mean,  median,  variance,  standard  deviation,  coefficient 
of  variation,  skewness  and  kurtosis)  can  be  obtained  In  the 
following  manner; 

0*  ± (S*)  . 

Here  Is  the  jacknife  estimate  of  the  sample  data  (ob- 

tained from  column  one  under  summary  for  jacknifed  data); 
s*  Is  the  jacknife  estimate  of  the  standard  deviation 
divided  by  the  square  root  of  the  number  of  groups  (obtained 
from  column  four  under  summary  for  jacknifed  data);  r Is 
the  number  of  groups  chosen;  and,  ^(i-^sa)  (r-1) 
l-Jja  quantile  of  the  t-distrlbutlon  with  r-1  degrees  of  free- 
dom. The  basis  for  these  assertions  about  the  confidence  In- 
tervals using  the  jacknifing  technique  is  asymptotic  and  great 
care  must  be  taken  In  using  them. 

1 

I 
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HISTJACK 

TtP£  THU  KUHBSa  OF  GHOUPS  YOU  liFSIltF  UHTFGF.R 
BEnEE»  2 ABO  SO  ) BE  SUIlE  TO  PICK  YOUR  KUMBER 
OF  CROUPS  SO  AS  TO  MIBIHIZE  THE  RUMBFR  OF  DATA 


FIGURE  8 
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YARIAHCE  H.tiBHEOY  2.SMS3^15  1.7613r07 

STD  DEV  7.0IS<lC03  1.3879C07  9.313Sf02 

COEF  VAR  *.SOS3rOO  2. >1262^00  3.89>iOe-01 

SKEWHESS  7.3732f00  1.29D3C01  9.0012C-01 

KVRTOSIS  B.IOllEOi  9.6906K03  1.7109rol 


C.  INTERPRETATION  OF  RESULTS 

To  compare  the  confidence  interval  obtained  for  the  co- 
efficient of  variation  using  the  sectioning  routine  with 
that  obtained  using  the  jacknife  routine  the  following  was 
done.  The  jacknife  estimate  of  the  coefficient  of  variation 
for  the  16  groups  is  4.5053  (column  1).  The  jacknife  esti- 
mate of  the  standard  deviation  divided  by  the  square  root  of 
16  is  .3894  . Using  a = .05,  the  t value  with  15  degrees 
of  freedom  is  2.131.  Thus,  the  95%  confidence  interval  for 
the  coefficient  of  variation  for  telephone  data  1 is  4.5053 
+ (.3894)  (2.131)  which  is  [3.676,  5.335].  This  compares 
with  the  confidence  interval  of  [3.454,  4.781]  using  the  sec- 
tioning routine  described  in  section  IV.  Likewise,  confi- 
dence intervals  on  the  remaining  six  statistics  could  be  ob- 
tained in  a similar  manner.  Note  that  the  values  obtained 
for  the  skewness  coefficient  from  the  sections  are  now  not 
evidently  biased;  of  the  16  values,  7 have  values  below  the 
value  7.1531  for  all  the  data. 

D.  USAGE  WITH  COST  OVERRUN  DATA 

To  demonstrate  how  the  complete  jacknife  could  be  used 
and  why  it  is  better  to  use  when  possible,  the  following  was 
done.  The  22  data  points  of  the  cost  overrun  data  were  used 
with  the  jacknife  routine  (HISTJACK).  When  HISTJACK  was 
typed  the  data  was  entered  in  the  variable  YROVR  and  22  was 
typed  as  the  number  of  groups.  By  typing  22,  which  is  the 
same  as  the  number  of  data  points,  a complete  jacknife  was 
done. 
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Looking  at  the  output  from  the  complete  jacknife  (figure 
9),  the  cost  overrun  data  can  be  studied.  One  can  note  that 
by  using  the  complete  jacknife  the  mean,  median,  and  variance 
of  the  jacknife  estimate  (column  one  under  summary  for  jack- 
nifed  data)  are  the  same  value  as  the  ungrouped  mean,  median 
and  variance.  But,  also  note  that  the  coefficient  of  varia- 
tion is  less  than  zero  which  can  happen  when  using  the  jack- 
nife technique. 
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FIGURE  9 
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VI.  EXPONENTIAL  PLOTTING  ROUTINE 


A.  DESCRIPTION 

The  fifth  routine  presented  is  an  exponential  plotting 
routine.  Routine  EXPONP  is  a way  of  plotting  the  data  to  | 

see  if  it  "fits"  an  exponential  distribution,  and  also  to 
give  some  indication  of  what  alternative  distributions  could 
be  used  if  the  exponential  hypothesis  is  rejected.  j 

A complete  description  of  how  EXPONP  operates  is  con- 
tained in  the  variable  EXPONPHOW  . When  the  user  types 
EXPONPHOW  the  following  response  is  printed  on  the  terminal. 

EXPONPHOW 
SYNTAX  EXPONP 

EXPONP  ORDERS  THE  DATA  I(J)  AND  COMPUTES 

THE  EMPIRICAL  LOG  SURVIVER  FUNCTION  FOR  THE  DATA. 

THAT  IS, 

■ 

\ / I l\  I / I \ 

X VS  I 1\||1 I 

/ \ (I)  l__  I I \ -v+1  / 

THE  ORDERED  DATA  IS  PLOTTED  AGAINST  THE  LOG  SUR- 
VIVER FUNCTION  TO  SEE  IF  THERE  IS  A LINEAR  FIT.  j 

EXPONP  ALSO  ALLOWS  YOU  TO  TITLE  YOUR  PLOT. 

i 

J 

B.  USAGE  WITH  TELEPHONE  DATA  1 1 

j 

EXPONP  was  used  with  telephone  data  1 to  see  if  the 
data  plotted  as  a relative  straight  line.  When  EXPONP  was 
typed  the  following  responses  were  entered. 


I 

J 


1 


EXPONP 

EXPONP  ORDERS  THE  DATA  YOU  OIVE  AND  COMPUTES  THE 
EMPIRICAL  LOG  SURVIVER  FUNCTION  FOR  THE  DATA. 
A PLOT  OF  THE  LOG  SURVIVER  FUNCTION  FOR  THE  DATA 
IS  THEN  PRINTED  TO  SEE  IF  THERE  IS  A LINEAR  FIT. 

IF  YOU  WANT  TO  TITLE  YOUR  PLOT  TYPE  YOUR  TITLE. 
IF  YOU  DO  NOT  WANT  A TITLE  JUST  HIT  T»E  CARRIAGE 

RETURN . 

TELEPHONE  DATA  1 
ENTER  YOUR  DATA  IN  VECTOR  FORM 
TELDATl 


Looking  at  figure  10  (plot  of  telephone  data  1 using  EX- 
PONP ),  it  was  found  that  the  data  did  not  plot  linearly  from 
the  origin,  but  that  the  data  did  appear  somewhat  linear  in 
-the  tail  (5,000  to  90,000  range). 


C.  USAGE  WITH  RANDOM  GENERATED  EXPONENTIALLY  DISTRIBUTED 

SAMPLE  WITH  MEAN  SAME  AS  TELEPHONE  DATA  1 

As  a comparison,  EXPONP  was  used  with  an  exponentially 
generated  random  sample  with  the  same  mean  as  telephone  data 
1 (figure  11).  As  expected,  this  plot  is,  within  limits  of 
sample  fluctuations,  linear  from  the  origin  and  in  fact,  what 
telephone  data  1 would  have  looked  like  if  the  data  was  truly 
exponential.  The  quantization  because  of  the  coarseness  of 
the  APL  type-ball  is  evident  in  this  plot.  The  sample  size 
is  672  , but  not  all  these  points  can  be  plotted  separately. 
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FIGURE  n 
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1500  3000  M500  COOO  7500*  9000  10500  12000  13500  15000 

ORDERED  DATA 


VII.  NORMAL  PLOTTING  ROUTINE 


A.  DESCRIPTION 

The  final  routine  presented  is  a normal  plotting  routine. 

Routine  NORMP  is  a way  of  plotting  the  data  to  see  if  it 

"fits"  a normal  distribution.  In  particular  one  might  want 

to  look  at  estimates  of  descriptive  statistics  obtained  from  > 

sections  and  groups  in  routines  HISTS  and  HISTJACK  . 

A complete  description  of  how  NORMP  operates  is  con- 
tained in  the  variable  NORMPHOW  . When  the  user  types  NORMP- 
HOW  the  following  response  is  printed  on  the  terminal. 

NORMPHOW 

S:mTAX  NORMP 

NORMP  ORDERS  THE  DATA  X(.I)  AND  COMPUTES  THE 
INVERSE  OF  THE  UNIT  NORMAL  CUMULATIVE  DISTRIBU^ 

TION.  THAT  15, 

\ / T-1  / I \ 

X VS  ^ I I 

/ \ (J)  X \ N+1  / 


THE  ORDERED  DATA  IS  PLOTTED  AGAINST  THE  INVERSE  OF 
THE  UNIT  NORMAL  CUMULATIVE  DISTRIBUTION  TO  SEE 
IF  THERE  IS  A LINEAR  FIT.  NORMP  ALSO  ALLOWS  YOU 
TO  CONVIENTLY  TITLE  YOUR  PLOT. 
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B.  USAGE  WITH  COST  OVERRUN  DATA 


NORMP  was  used  with  the  cost  overrun  data  to  see  if 
the  data  plotted  as  a relative  straight  line.  When  NORMP 
was  typed  the  following  responses  were  entered. 

NOR^fP 

NOR'fP  ORDERS  THE  DATA  YOU  GIVE  AfJD  nO'ir>jjr'vs  T»E 
inVERSE  OF  THE  UNIT  NOR''^AL  CUMULATIVE  DISTRIBU- 
TION FOR  THE  DATA.  A PLOT  OF  THE  INVERSf^  OF  THE 
UNIT  NORMAL  CUMULATIVE  DISTRIBUTTON  VS  THE  ORDER- 
ED DATA  IS  THEN  PRINTED  TO  SEE  IF  THERE  IS  A 
LINEAR  FIT. 

IF  YOU  WANT  TO  TITLE  YOUR  ^iqt  TY^E  YOUR  '^ITL>=' . 

IF  YOU  DO  NOT  WANT  A TITLE  JUST  HIT  CA’^RIAGE 

RETURN . 

COST  OVERRUNS 

ENTER  YOUR  DATA.  IN  VECTOR  FORM 
□ : 

YROVR 


Note  that  the  cost  overrun  data  was  contained  In  the 
variable  YROVR  . Looking  at  figure  12  (plot  of  cost  over- 
run data  using  NORMP  ),  It  was  found  that  the  data  did  In 
fact  plot  fairly  linear  through  the  range  -14  to  26  (for- 
mal tests  are  available;  see  Wllk  & Gnanadesi kan , 1968). 


C.  USAGE  WITH  NORMAL  SAMPLE  GENERATED  WITH  MEAN  AND 
VARIANCE  THE  SAME  AS  COST  OVERRUN  DATA 

As  a comparison,  NORMP  was  used  with  a normal  sample 
with  the  same  mean  and  variance  as  the  cost  overrun  data 
(figure  13).  As  expected,  this  plot  Is  very  linear.  But 
again,  this  plot  Is  not  that  much  different  from  that  of  fig 
ure  12,  which  gives  credence  to  the  fact  that  the  cost  over- 
run data  might  In  fact  be  normally  distributed. 
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COST  OVEFRUHS 


ORDERED  DATA 


tONMAl  StHPLt  GEtSPATED  UlTH  SAME  HEAP  AMD  VAPIAECE  AS  COST  OVEREUE  DATA 


FIGURE  13 
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ORPERED  PAT  A 


0.  USAGE  WTH  COEFFICIENT  OF  VARIATION  DATA  OBTAINED 

FROM  USING  SECTIONING  ROUTINE 

In  order  to  check  for  normality  in  the  sectioned  esti- 
mates obtained  from  using  HISTS  (sectioning  routine)  the 
following  was  done.  The  16  coefficient  of  variation 
values  obtained  from  using  HISTS  with  telephone  data  1 
(column  5«  figure  7)  were  entered  as  a vector  Into  NORMP  . 
Figure  14  shows  that  the  plot  is  marginally  linear.  This 
demonstrates  the  need  for  formal  tests  to  verify  normality 
In  the  absence  of  a strictly  linear  plot  (Ullk  & Gnanadsikan, 
1968). 
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VIII.  THE  INDEPENDENCE  AND  MARKOV  CHAIN 

hypotheses  for  the  telephone  data 


The  telephone  data  used  In  the  thesis  (Lewis  & Cox,  1966) 
actually  consists  of  binary  bits  transmitted  over  telephone 
lines  and  the  Information  that  the  bit  transmitted  at  time  1, 

1 = 0,1,2,...  Is  in  error  or  not.  This  information  Is 
characterl zed  by  a sequence  of  binary-valued  random  variables 
x(i),  1 = 0,1,...  where  x(i)=1  means  that  the  bit  trans- 
mitted at  time  1 is  In  error,  while  x(1)=0  means  that  the 
bit  transmitted  at  time  zero  is  correctly  transmitted. 

In  telephone  data  1 there  are  672  ones  and  1,105,476 
zeros,  and  a much  more  compact  and  equivalent  representation 
of  the  data  is  obtained  via  the  sequence  of  random  variables 
y(j)»  0=1.2,...  where  y(j)  is  one  plus  the  number  of  cor- 
rectly transmitted  bits  between  the  and  (j-1)^^  bit  error, 
with  the  convention  that  y(j)=l  if  the  errors  occur  on  adja- 
cent transmitted  bits,  and  y(l)  is  the  time  from  i=0  to  the 
first  incorrectly  transmitted  bit.  The  y(j)  are  called  the 
ti mes- be tween-errors . 

A null  hypothesis  for  the  error  structure  which  could  be 
examined  is  that  errors  occur  independently  at  each  bit  with 
a fixed  probability,  i.e. 

P{x(i)  = 1}  = tt(1)  i=0,l,... 

P{x(i)  = 0}  = 7t(0)  = 1-tt(1)  i=0,l,... 
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The  y(o)'s  then  are  Independent  and  geometrically  dis- 


tributed, since 

Pty(j)*n  “ PHf  (j-1)^^  error  at  time  1;  at 

time  1+1} 

» ir(l ) 

P{y(j)“2}  * P{1f  (j-1)^^  error  at  time  1;  at 

time  1+2} 

= Tr(l)[l-7v(l)]  = Tr(l)Tr(0) 

P{y(j)=k+1}  = P{iT  (j-1)^^  error  at  time  1;  at 

time  1+1 +k} 

= Tr(l)[l-7r(l)]''  = 7r(l)[7r(0)]'' 

Note  that,  using  the  geometric  series  summation  formula, 

i P{y(j)-k)  - --  ■ 1 

kP{y(j)  = k)  = 

Now  assume  that  the  Markov  structure  of  the  zero's  and 
ones  Is  described  by  the  transition  matrix 


1 P(0,0) 

D - ) 

P(0,1) 

P+(l-p)Tr(l) 

(l-p)tr(O) 

r - / 

( P(1,0) 

P(l.l) 

(l-p)Tr(l) 

P+(1-p)tt(0) 

Here  P(m,n)  = 

P{x(i+1 )= 

n 1 

x(i)=m}  , and 

we  have  para 

meterized  the  chain  In  terms  of  the  stationary  probability 
of  a one  or  zero,  and  a correlation  parameter  0^p<l  . Note 
that  there  are  only  two  degrees  of  freedom  in  the  stochastic 
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matrix,  since  rows  must  sum  to  1,  and  there  Is  only  one  de 
gree  of  freedom  If  the  stationary  probability  Tr(0)*l-Tr(l ) 
Is  fixed.  Note  that  the  stationary  probabilities  In  the  2 
state  case  are  given  by 

" 2-no!o!-P(l  ,1 ) “ 2-pioioi-P(l  ,1 ) 


We  now  define  the  runs  of  ones  or  zeros  I.e.  for  s,=0  or 
t=l,  let 

\ 

= inf{n^l:  x(i+n)  f t}-l  , 

the  length  of  a run  of  t's, starting  after  time  i,  where  the 
length  can  be  0,1,2 

For  example  if  x(i+l)=l  , then  the  length  of  runs  of 
zeros  starting  after  time  i is  zero,  the  length  of  runs  of 
ones  is  at  least  one  long.  Note  that  it  is  possible  to  talk 
of  a conditional  runs  structure,  i.e.  the  length  of  a run  of 
ones  which  is  given  to  start  after  time  i . The  run  length 
Is  then  at  least  one  long. 

Now  the  probability  of  a run  having  length  greater 

than  k is,  using  the  Markov  property, 

P{Tj^>k}=  P{x(i+l)=x(i+2)  = ...x(i+k)=il}=TT(il)[P(Jl,il)]'^"'' 

k — 1 , . . . 

and  P{Tj^  =0}  = l-ir(Jl)  . 

Thus,  the  run  lengths  are  geometrically  distributed  and 

E[TU)]  = f P{Tj>k)  = ■ (t-plf-H-irrSTJ 
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i 

I 

) 

i 

Note  that  p=0  gives  the  independence  case,  and  while 
the  runs  of  ones  or  zeros  are  geometrically  distributed  for 
both  the  independence  or  Markov  dependent  model,  the  mean 
run  length  is  always  longer  for  the  Markov  dependence,  since 


Thus,  we  could  use  the  distributional  properties  of  the 


- [l-Tr(i)] 


0<p<l 


runs  to  (1)  check  that  either  hypothesis  is  tenable  or  (2) 
if  so,  compare  the  estimated  run  lengths  with  the  mean  length 
Tr(il)/ [1 -tt(JI) ] predicted  by  the  independence  assumption.  If 
the  run  lengths  are  not  geometric,  than  another  model  must  be 
postulated. 

Note  that  when  this  mean  time-between-errors  is  large  as 
it  is  for  telephone  data  1 (figure  1;  E[y(j)]=  1,548)  the 
discreteness  of  the  time  scale  can  be  ignored  and  the  geometric 
distribution  is  indistinguishable  from  its  continuous  time 
analog,  the  exponential  distribution. 

That  is  approximation  of  the  geometric  distribution  by 
an  exponential  distribution  is  valid  can  be  seen  from  the 
fact  that  there  are  672  errors  (x(i)'s  equal  to  one)  in 
1,106,148  transmitted  bits,  so  that  an  estimate  of  ir(l)  , 
which  is  the  maximum  likelihood  estimate  under  the  independence 
hypothesis,  is 

# X ( i ) ' s = 1 _ # x(i ) ' s = 1 

total  # bits  transmitted  # x(i ) 's=l+#x(i ) 's=0 


. _ J 


^?(1)  = 


In  the  present  data 


Tr(l)  = sZi = .0006075 

1.106,148 

Now  this  geometric  hypothesis  will  be  examined,  but  it 
is  clear  from  figure  1 that  the  hypothesis  is  not  true.  The 
distribution  is  in  fact  highly  skewed  and  has  been  examined 
by  Lewis  & Cox,  1966. 

An  alternative  model  to  independent  bit  errors  is  that 
the  dependence  structure  is  Markovian.  One  could  examine 
this  hypothesis  with  time-series  methods  but  a method  which 
is  adaptable  for  use  with  the  histogram  routine  and  which  ex- 
amines both  the  independence  and  Markov  assumptions  is  to 
look  at  runs  of  ones  and  zeros  in  the  x(i).  Under  both  hypo- 
thesis these  runs  have  geometrically  distributed  lengths. 

The  alternating  conditional  runs  of  ones  for  telephone 
data  1 are  shown  in  figure  15  and  for  runs  of  zeros  are  shown 
in  figure  16.  Also,  HISTLIST  was  used  on  the  conditional 
runs  and  figure  17  shows  the  runs  of  ones  and  figure  18  shows 
the  runs  of  zero. 

To  test  the  hypothesis  that  the  runs  of  ones  in  telephone 
data  1 is  geometrically  distributed  the  following  was  done. 
Using  figures  15  and  17  the  following  data  was  obtained: 

MEAN  = 1.235294  # of  runs  = 1 = 444 

VARIANCE  = .346008  # of  runs  = 2 = 81 

# of  runs  = 3 = 15 

# of  runs  = 4 = 1 

# of  runs  = 5 = 2 

# of  runs  > 6 = 1 
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HUMS  OF  ONE  FOH  TELEPIIOHE  DATA 


CENTRAL  TEMDEMCI  SPREAD  UIOHEH  CENTRAL  MOHENTS  DISTRIBUTION 


HUNS  oy  ZEliO  h'OH  THIEPHONF.  UATA 


FIGURE  16 
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FIGURE  18B 
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FIGURE  18C 
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If  the  runs  of  ones 

{l-p)p'^"'‘  k»1.2 

one"  distribution. 

P * 


C(X) 

To  find  p set  E[X]  = 1.235294  = l/(l-p) 

p = .1904761 

Therefore,  if  the  data  is  "geometric  plus  one"  then 

EXPECTED  VAR[X]  = . 1 904761 /( .8095329 ) ^ 

= .2906572 

Thus,  the  expected  variance  is  .2906572  and  the  observed  var- 
iance from  HIST  is  .3460080  . Also,  the  expected  coefficient 
of  variance  is 

EXPECTED  C(X)  = (.1904761)^  = .4364356 

And,  the  observed  coefficient  of  variation  is  .4761817  . 

Therefore,  at  this  point  there  seems  to  be  a fairly  close 
agreement  between  the  runs  of  one  and  a "geometric  plus  one" 
distribution  with  p = .1904761  . 

As  further  proof  a Chi-square  test  for  goodness  of  fit 
was  run  on  the  runs.  By  using  the  formula 

prob  {X  = x}  = (1-p)p^“^  for  x=1 ,2 ,3,4, 5, . . . 


are  geometric  then  prob{x(i )=k}  = 
Thus,  this  is  the  "geometric  plus 


E[X]  = 


^ VAR[X] 


1 


(1  - P) 


1 


(1  - p)' 


VAR[XP  . ^5 

E[X] 
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PROBABILITY 

EXPECTED 

OBSERVED 

PI 

[X=l) 

S 

.8095239 

440.38 

444 

PI 

:x=2) 

= 

.1541949 

83.88 

81 

PI 

[X=3) 

= 

.0293704 

15.98  ' 

15 

1 

PI 

[X=4) 

s 

.0055943 

1 

[ 

PI 

X=5) 

= 

.0010655 

2 

> 

PI 

lX>6) 

= 

.0002510 

.14 

J 

1 

1 

Note,  to  use  Chi-square  not  more  than  20%  of  the  cells 
should  have  expected  frequencies  less  than  5 and  no  cell 
should  have  an  expected  frequency  less  than  one.  Therefore, 
the  above  frequencies  must  be  combined  Into  3 cells. 

o \2 


2 3 (obs.  - ex.) 

X — = .1562799 

1 = 1 ex.. 


And,  x^05  2 “ ‘ hypothesis  that  the 

runs  of  one  are  "geometric  plus  one"  with  p = .1904761  can 
not  be  rejected. 

A similar  procedure  was  done  with  the  runs  of  greater 
than  one.  By  using  figure  15  the  following  Information  can 
be  obtained; 


MEAN  = 1911.27 
VARIANCE  = 59,064,970 
C0EF.VAR.=  4.021082 


And,  by  using  the  same  method  as  previously  done  and  solving 
for  p one  gets  p = . 9994767  . 


EXPECTED  VAR[X]  = . 9994767/ (. 0005233 ) ^ = 3.651.213 
This  expected  variance  differs  greatly  from  the  observed 
variance.  Also,  the  expected  coefficient  of  variation  is 


F 


computed  to  be 


EXPECTED  C(X)  = (.9994767)’*  » .9997383 

This  compares  with  the  observed  coefficient  of  variation  of 
4.021082  . Because  of  the  gross  departures  of  the  variance 
and  the  coefficient  of  variation  in  the  geometric  hypothesis, 
one  can  conclude  that  the  runs  of  length  greater  than  1 are 
not  geometrically  distributed. 
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IX.  DOCUMENTATION  ON  ROUTINES 


A.  LOCATION  IN  APL  LIBRARY 

The  descriptions  and  routines  that  have  been  presented 
are  a11  available  In  the  APL  workspace  library  2 DATALFNS  . 

Providing  the  user  Is  properly  logged  on  the  terminal  and  In 
the  APL  mode,  all  that  Is  necessary  Is  to  type  )L0AD  2 
DATALFNS  . If  the  user  then  types  DESCRIBE,  a short  descrip- 
tion  of  the  six  routines  presented  and  Instructions  on  how 

i 

to  obtain  the  detailed  Information  that  Is  available  In  each 
of  the  "HOW"  variables  would  be  printed. 

B.  WORKSPACE  LOADING  PROCEDURES  | 

Each  of  the  routines  was  designed  to  stand  alone.  That  3 

Is,  If  the  user  desires  just  to  use  HIST  , all  that  is  neces- 
sary Is  to  type  )C0PY  2 DATALFNS  HISTGRP  into  a clear  work- 
space. HISTGRP  contains  the  principal  routine  HIST  and 
only  the  additional  routines  necessary  for  HIST  to  operate. 

Thus,  the  user  does  not  clutter  his  workspace  with  any  un- 
needed functions.  It  Is  this  group  structure  that  maintains 
the  orderliness  of  the  workspace.  And,  the  ability  to  copy 
a particular  group  Into  a clear  workspace  provides  more  space 
for  data  and  executions  of  the  functions. 

The  following  Is  the  group  structure  in  library  2 
DATALFNS  . 


. 


r 


PRINCIPAL  OTHER  NECESSARY 


! GROUP 

ROUTINE 

ROUTINES 

VARIABLES 

HISTGRP 

1 

i 

i ; 

HIST 

APLNAME.APLOT, AUTOS, 
CMS,DFT,ECDF,ECODE, 
EFT, OF, OUT, WRITE 

j‘  HISTLISTGRP 

HISTLIST 

APLNAME,CMS,ECODE. 
DFT, OF, OUT, WRITE 

^ HISTSGRP 

HISTS 

OFT. EFT 

j HISTJACKGRP 

HISTJACK 

OFT, EFT, TOT 

EXPONPGRP 

EXPONP 

AND, AUTOSCALE, 
INITIAL, MPLOT, MSGS, 
VS, MULTIPLOT. SETAAP, 
TICMARK 

NORMPGRP 

NORMP 

AND, AUTOSCALE, 
INITIAL, MPLOT. MSGS, 
VS, MULTIPLOT, SETAAP, 
TICMARK 

BS 

DESC6RP  (Descriptive  group)  DESCRIBE .HISTHOW 

HISTHOW.HISTLIST- 

HOW.HISTJACKHOW, 

EXPONPHOW.NORMPHOW 

VARIGRP  (Variable  group)  TELDATl .TELDAT2 , 

YROVR 

C.  ROUTINE  LISTING 

The  above  mentioned  routines  were  either  created  by  the 
author,  adapted  from  existing  fortran  routine  HISTG/F  , or 
borrowed  from  the  current  APL  library  to  supplement  the 
author  created  routines. 


5 
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1 • Author  Created  Routines 


HISTLIST,  MISTS,  HISTJACK,  EXPONP,  NORMP.  APLOT, 
AUTOS,  OUT,  TOT 

2.  Adapted  from  Fortran  Library  Routine  HISTG/F 
HIST,  ECDF 

3*  Borrowed  Routines  to  Supplement  Author  Crgatori 
Routi nes 

AND,  APLNAME,  AUTOSCALE,  CMS,  OFT,  ECODE,  EFT, 
INITIAL,  MPLOT,  MSGS,  MULTIPLOT,  NDTRI,  OF,  SETAAP,  TICMARK, 
VS,  WRITE 
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X.  COMPUTER  LISTING  OF  ALL  ROUTINES 
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